Skip to content

Adds lm_eval to evaluations #282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open

Adds lm_eval to evaluations #282

wants to merge 48 commits into from

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented Jun 2, 2025

✨ Description

Add lm_eval integration which enables running evaluation directly from an in-memory model. Currently, only data parallelism is supported.

Closes #199

πŸ” Type of change

Select all that apply:

  • πŸ› Bug fix (non-breaking change that addresses a specific issue)
  • πŸš€ New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • πŸ“ˆ Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • πŸ› οΈ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • πŸ“¦ Dependency bump (updates dependencies, including Dockerfile or package changes)
  • πŸ“ Documentation change (updates documentation, including new content or typo fixes)
  • πŸ”§ Infrastructure/Build change (affects build process, CI/CD, or dependencies)

πŸ“ Changes

List the key changes introduced in this PR:

  1. Improved Weights & Biases (wandb) Logging
    The final log for each step now explicitly commits the data, ensuring that the step is immediately visible in the wandb interface. Previously, a step would only appear after logging began for the following step.

  2. Integration of lm_eval Evaluator
    Added support for lm_eval as a built-in evaluation method. This enables running standard language model benchmarks during or after training using the Evaluation Harness.

  3. Extended Support for Distributed Primitives
    Introduced custom distributed utilities adapted from PyTorch to support non-standard process groups to support lm_eval integration.

  4. Added User Guide for Evaluations
    Included documentation explaining how to configure and use evaluators such as loss and lm_eval, both during training and in standalone evaluation mode.

βœ… Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • πŸ“œ I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • πŸŽ‰ The functionality is complete, and I have tested the changes.
  • πŸ“ I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

  • πŸ§ͺ I have added or updated tests to cover my changes.
  • βœ”οΈ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • πŸ‹οΈ I have tested the changes on realistic training workloads, if applicable.

Base automatically changed from denis/evaluate to main June 19, 2025 15:44
@bigximik bigximik changed the title [work in progress] Adds lm_eval to evaluations Adds lm_eval to evaluations Jun 30, 2025
@bigximik bigximik marked this pull request as ready for review June 30, 2025 15:17
@bigximik bigximik requested a review from jlamypoirier June 30, 2025 15:17
@bigximik
Copy link
Contributor Author

bigximik commented Jun 30, 2025

Log-likelihood calculation in the lm_eval wrapper is slow because post-processing of logits is performed on the CPU (see TODOs and NOTEs in the code). However, it is still faster than generation without KV cache. I propose optimizing it after KV cache implementation in a separate PR.

" passed to the Fast-LLM lm_eval model wrapper.",
)

add_bos_token: bool = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be determined by the model itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i understand from comments across the code some models just underperform without it but by default it is off for all models

print(f"Determined largest batch size: {self.batch_sizes[sched]}")
return self.batch_sizes[sched]

def _loglikelihood_tokens(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It computes per-token log-likelihoods, which are used by lm_eval to calculate various evaluation metrics. The function applies softmax to the returned logits to obtain token-level probabilities. Currently, the logits are moved to the CPU for post-processing, which can be slow but helps avoid GPU memory pressure.

There is a TODO to distribute this function and return only per-token log-probabilities from each shard instead of gathering all logits centrally. However, I prefer to implement that in a separate PR.

# Note: metrics modified in-place
if self._wandb is not None:
import wandb

wandb.log(metrics, step=completed_steps) # noqa
wandb.log(metrics, step=completed_steps, commit=commit) # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach allows the code to log multiple times for the same step. When the final log for that step is recorded, all previous logs for that step become immediately visible in Weights & Biases (wandb). There's no longer a need to wait for the next step to start logging for the previous step's logs to appear. Previously, logs for a step would remain hidden until logging began for the next step.

@bigximik bigximik requested a review from tscholak July 8, 2025 10:06
Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting close, but the lack of tests will be a problem. Do you intend to add some?

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main code looks ready, but I have some concerns with the tests. If we want to merge now I suggest moving the test changes to another PR, then I can approve right away.

]


@pytest.mark.extra_slow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this take? It would be worrying not to have any tests other than extra-slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very long 40-80 sec per test



@pytest.fixture(scope="function")
def get_lm_eval_config(tokenizer_path, monkeypatch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to reduce test time by trimming bloat in lm_eval, reducing evaluation size and restricting task to wikitest (others are much slower). Should still be enough since we're only testing fast-llm's wrapper, not lm_eval itself. It's now below 10 seconds so we no longer need to mark as extra_slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks

run_test_script_for_all_models(
get_lm_eval_config(run_test_script_base_path / "test_lm_eval_in_training"),
compare="test_lm_eval_in_training",
prepare_fn=_prepare_resume_fn,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense, we're starting an evaluation and not resuming training, so we should use the checkpoint as a pretrained model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I created a function that simply copies the training run with a checkpoint, but the evaluation still starts from the training config. I haven’t tested it starting from a pretrained checkpoint yet β€” maybe we should leave that for another PR?

cartesia_pytorch>=0.0.2

GENERATION =
lm_eval>=0.4.9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dependency, packages we use should be in setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the section should be named 'evaluations', since it's not specifically for generation?

@@ -392,3 +394,23 @@ def enabled(self) -> bool:
@property
def interrupted(self):
return self._interrupted


def set_global_variables(disable_torch_dynamo: bool = False) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this into a separate method so we can call in more places ex. conftest and cli things that don't go through Run

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ready to merge once remaining comment is addressed and tests pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run lm-eval-harness benchmarks during validation
2 participants